Skip to content

fix(journal): serialize concurrent appends via lockfile#1002

Open
carlos-alm wants to merge 6 commits intomainfrom
fix/journal-atomic-appends
Open

fix(journal): serialize concurrent appends via lockfile#1002
carlos-alm wants to merge 6 commits intomainfrom
fix/journal-atomic-appends

Conversation

@carlos-alm
Copy link
Copy Markdown
Contributor

Summary

Fixes #996.

appendJournalEntries() relied on fs.appendFileSync alone — that's atomic within a single call but not across concurrent writers. When a watch session and a manual codegraph build in another shell both write to .codegraph/changes.journal, their lines can interleave, producing truncated DELETED prefixes, partial paths, or a \n-less tail. A corrupted header then makes readJournal return valid: false and silently falls through to a hash scan — a silent perf regression, not a crash.

Fix

Wrap both appendJournalEntries and writeJournalHeader in withJournalLock:

  • Acquires .codegraph/changes.journal.lock with fs.openSync(path, 'wx') — atomic exclusive-create on ext4/NTFS/APFS.
  • Stamps the holder PID inside the lockfile.
  • Retries every 25ms up to a 5s timeout.
  • Steals the lock if the holder PID is dead (process.kill(pid, 0)) or the file's mtime is older than 30s — covers crash-mid-write.
  • Releases the fd + unlinks the lockfile in finally.

Zero new dependencies. Atomics.wait on a SharedArrayBuffer provides synchronous retry sleep so the existing sync API is preserved (watcher and build stages call these functions synchronously).

Test plan

  • npx vitest run tests/unit/journal.test.ts — 20/20 pass (17 existing + 3 new: lockfile cleanup, dead-PID stale-lock stealing, no-interleave across 200 mixed appends)
  • npx vitest run tests/builder/detect-changes.test.ts tests/integration/build.test.ts tests/integration/watcher-rebuild.test.ts — 27/27 pass
  • npm run typecheck — clean

appendJournalEntries used fs.appendFileSync with no cross-process
coordination, so a watcher session and a manual codegraph build in a
second shell could interleave lines and corrupt .codegraph/changes.journal
(truncated DELETED prefixes, partial entries, newline-less tails).

Wrap both appendJournalEntries and writeJournalHeader in a
withJournalLock helper that:
- acquires .codegraph/changes.journal.lock via fs.openSync(path, 'wx')
  (atomic exclusive-create, cross-platform)
- stamps the holder PID for stale-lock detection
- retries every 25ms up to a 5s timeout
- steals the lock if the holder PID is dead or the lock file is older
  than 30s (covers crash-mid-write)
- always releases in finally

Zero new dependencies — uses only node:fs and Atomics.wait on a
SharedArrayBuffer for synchronous retry sleep, preserving the existing
sync API.

Fixes #996

Impact: 7 functions changed, 12 affected
@claude
Copy link
Copy Markdown

claude Bot commented Apr 21, 2026

Claude finished @carlos-alm's task —— View job


PR Review: fix(journal): serialize concurrent appends via lockfile

Todo List

  • Read and analyze the modified journal implementation
  • Review test changes for integrity and coverage
  • Check for configuration changes or rule relaxations
  • Perform root cause analysis
  • Assess backlog compliance criteria
  • Provide final review assessment

Root Cause Analysis

Problem Being Solved: Race condition where concurrent journal writers (watch session + manual codegraph build) can interleave lines in .codegraph/changes.journal, causing corruption like truncated "DELETED " prefixes or paths without newlines. This leads to readJournal returning valid: false and falling back to expensive hash scans.

Why This Approach: Uses atomic file locking via fs.openSync(path, 'wx') which is an OS-level exclusive-create operation. The approach is sound - wraps both appendJournalEntries and writeJournalHeader in withJournalLock to serialize all journal writes.

Risk Assessment: LOW - Well-implemented defensive approach with proper timeout, dead PID detection, and stale lock cleanup.

Backlog Compliance

  • Zero-dep: ✓ — Uses only Node.js built-ins (fs, process.kill, SharedArrayBuffer, Atomics)
  • Foundation-aligned: ✓ — Solves core reliability issue without violating any principles
  • Problem-fit: 4 — Directly prevents AI agents from experiencing silent perf degradation due to corrupted journals
  • Breaking: No — Internal implementation change, no API changes
  • Tier: 1

Technical Review

Implementation Quality ⭐⭐⭐⭐☆

Strong Points:

  • Atomic locking: fs.openSync(path, 'wx') is the correct way to do exclusive file creation
  • Dead PID detection: process.kill(pid, 0) with proper EPERM handling (lines 17-25)
  • Stale lock cleanup: 30s timeout prevents indefinite blocking from crashed processes
  • Sync sleep: Clever use of SharedArrayBuffer + Atomics.wait for synchronous retry without async/await
  • Proper resource cleanup: finally block ensures fd close + lock unlink (lines 101-105)

Implementation Details:

  • Lock timeout: 5s (reasonable for file I/O)
  • Retry interval: 25ms (appropriate granularity)
  • Stale threshold: 30s (good balance)
  • Lock file contains PID for diagnostics (line 34)

Test Coverage ⭐⭐⭐⭐☆

New Tests Added:

  1. Lock cleanup verification (lines 200-207): Ensures .lock file is removed after successful operations
  2. Dead PID lock stealing (lines 209-223): Tests stale lock recovery using impossible PID (2147483646)
  3. Interleaving prevention (lines 225-243): 200 concurrent appends verify no corruption occurs

Test Quality:

  • Good edge case coverage (dead PID, lock cleanup, concurrent access)
  • Uses realistic scenario (200 mixed appends simulating high concurrency)
  • Proper assertion: regex validation that all lines match expected format
  • No test integrity issues - tests are additive and strengthen coverage

Potential Issues (Minor)

  1. SharedArrayBuffer compatibility: While widely supported, could theoretically fail in some environments. However, fallback would be a simple setTimeout with Promise, and this preserves the existing sync API.

  2. Lock file in same directory: Generally fine, though some filesystems might have issues with rapid create/delete cycles. The implementation handles races properly with try/catch blocks.

  3. No explicit file sync: fs.writeSync and fs.appendFileSync don't guarantee disk persistence before proceeding, but this shouldn't affect correctness for the lockfile use case.

Critical Concerns

None. This is a solid fix for a real concurrency issue.

Final Recommendation

  • Rating: ⭐⭐⭐⭐☆ (4/5)
  • Action: APPROVE
  • Reasoning: Well-implemented fix for a legitimate race condition. Uses appropriate OS primitives, has good test coverage, adds no dependencies, and solves a real reliability problem that affects AI agent performance. The implementation is defensive and handles edge cases properly.

The only reason this isn't 5 stars is that it's solving an existing issue rather than adding exceptional new value, but the execution is excellent.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 21, 2026

Greptile Summary

This PR serializes concurrent journal writers using an advisory lockfile (changes.journal.lock) with atomic exclusive-create, nonce-verified steal-via-rename, busy-spin retry, and stale-tmp sweeping. The three previously raised P1s (TOCTOU steal race, Atomics.wait event-loop freeze, empty-lockfile nonce vulnerability) are all addressed in the current HEAD. Two narrower P2 edge cases remain in the steal and writeSync-failure paths.

Confidence Score: 4/5

Safe to merge for typical usage; two residual P2 races only manifest under resource exhaustion (EMFILE) or I/O errors combined with concurrent writers.

All three prior P1s are resolved. Remaining findings are P2: a self-lockout under EMFILE when openSync('r+') fails after a successful steal rename, and an unguarded unlinkSync in the writeSync-failure handler that could delete a concurrent stealer's live lock under ENOSPC. Both require unusual OS-level failure modes on top of concurrent writers, so they don't block merge but are worth a follow-up fix.

src/domain/graph/journal.ts — trySteal (lines 84–91) and acquireJournalLock writeSync-failure handler (lines 106–115)

Important Files Changed

Filename Overview
src/domain/graph/journal.ts Adds withJournalLock with atomic steal-via-rename, nonce verification, busy-spin retry, and stale-tmp sweeping; wraps all three write exports. Two P2 edge cases: self-lockout when openSync('r+') fails after winning a steal, and unguarded unlinkSync in the writeSync-failure handler that can delete a concurrent stealer's live lock.
tests/unit/journal.test.ts Adds five lock-specific tests: cleanup, dead-PID steal, TOCTOU nonce-verification regression, interleave-free appends, and stale-tmp sweep. Coverage is solid for the happy paths and the previously reviewed races.

Sequence Diagram

sequenceDiagram
    participant W as Watcher
    participant B as Build
    participant FS as Filesystem

    W->>FS: openSync('wx') → fd [EEXIST if B holds lock]
    W->>FS: writeSync(fd, PID+nonce)
    Note over W: acquireJournalLock ✓

    B->>FS: openSync('wx') → EEXIST
    B->>FS: readFileSync(lock) → W's PID
    B->>FS: isPidAlive(W.PID) → true
    B->>FS: statSync(lock) → mtime fresh
    Note over B: shouldSteal=false → sleepSync(25ms) → retry loop

    W->>FS: appendFileSync(journal, lines)
    Note over W: fn() executes

    W->>FS: readFileSync(lock) → nonce match
    W->>FS: unlinkSync(lock)
    Note over W: releaseJournalLock ✓

    B->>FS: openSync('wx') → fd [lock now free]
    B->>FS: writeSync(fd, PID+nonce)
    Note over B: acquireJournalLock ✓
    B->>FS: appendFileSync(journal, lines)
    B->>FS: unlinkSync(lock)
    Note over B: releaseJournalLock ✓
Loading

Fix All in Claude Code

Reviews (5): Last reviewed commit: "fix(journal): sweep orphaned lockfile .t..." | Re-trigger Greptile

Comment thread src/domain/graph/journal.ts Outdated
Comment on lines +43 to +57
let holderAlive = true;
try {
const pidContent = fs.readFileSync(lockPath, 'utf-8').trim();
holderAlive = isPidAlive(Number(pidContent));
} catch {
/* unreadable — fall through to age check */
}

if (!holderAlive) {
try {
fs.unlinkSync(lockPath);
} catch {
/* another writer stole it first */
}
continue;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 TOCTOU race allows two processes to concurrently enter the critical section

When multiple processes are waiting on the same stale (dead-PID) lockfile, they can each independently steal the lock from one another's fresh acquisition:

  1. Process A dies — lockfile contains dead PID A.
  2. Processes B and C both call openSync('wx'), both get EEXIST, and both read dead PID A.
  3. B unlinks and immediately re-acquires (new inode I1, fn() starts).
  4. C now executes unlinkSync(lockPath) — it unlinks B's live lockfile (I1), not the stale one.
  5. C loops, calls openSync('wx'), succeeds (new inode I2), and its fn() begins concurrently with B's.

Both B and C are now inside the critical section simultaneously — exactly the write-serialisation invariant this PR is meant to guarantee.

The standard mitigation is to verify ownership after the unlink/re-create cycle: write a nonce (or use the inode number) inside the lockfile, then re-read it after openSync succeeds and bail if it doesn't match yours. Alternatively, use fs.renameSync(tmpfile, lockPath) to perform the steal atomically (create a temp file, rename onto the lockfile — POSIX rename is atomic).

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f5c737c. Replaced the unlink + openSync('wx') steal pattern with an atomic write-tmp + fs.renameSync steal. Each writer generates a random nonce, writes it into a temp file, and atomically renames onto the lockfile. After rename we re-read the lockfile and only proceed if our nonce is still there — if another stealer's rename landed after ours, we bail and retry instead of unlinking their live lock. Release also nonce-verifies before unlinking. Added a regression test that stages a lockfile with a different-writer nonce after a real steal cycle and asserts we do not retroactively unlink it.

Comment on lines +12 to +15
function sleepSync(ms: number): void {
const buf = new Int32Array(new SharedArrayBuffer(4));
Atomics.wait(buf, 0, 0, ms);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Atomics.wait freezes the Node.js event loop during lock contention

Atomics.wait is a synchronous, blocking call — it stops the entire V8 event loop for the full ms duration. In a watcher process, every filesystem notification, timer, and pending I/O callback is silenced for each 25 ms retry. In the worst case (5 000 ms timeout, 200 retries), the watcher becomes completely unresponsive for up to 5 seconds before ever throwing.

A lighter alternative that avoids blocking the event loop is a simple busy-spin with process.hrtime.bigint():

function sleepSync(ms: number): void {
  const end = process.hrtime.bigint() + BigInt(ms) * 1_000_000n;
  while (process.hrtime.bigint() < end) { /* spin */ }
}

This keeps each retry short and doesn't starve unrelated callbacks (though it does keep the CPU busy, which is acceptable for the brief per-retry duration).

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f5c737c. Replaced Atomics.wait with a short process.hrtime.bigint busy-spin per your suggestion. The 25ms retry interval keeps CPU burn negligible while letting pending FS events, timers, and I/O callbacks in watcher processes keep firing during contention.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 21, 2026

Codegraph Impact Analysis

12 functions changed14 callers affected across 5 files

  • sleepSync in src/domain/graph/journal.ts:17 (5 transitive callers)
  • isPidAlive in src/domain/graph/journal.ts:24 (5 transitive callers)
  • AcquiredLock.fd in src/domain/graph/journal.ts:36 (5 transitive callers)
  • AcquiredLock.nonce in src/domain/graph/journal.ts:37 (5 transitive callers)
  • trySteal in src/domain/graph/journal.ts:51 (5 transitive callers)
  • acquireJournalLock in src/domain/graph/journal.ts:94 (8 transitive callers)
  • releaseJournalLock in src/domain/graph/journal.ts:162 (8 transitive callers)
  • sweepStaleTmpFiles in src/domain/graph/journal.ts:180 (8 transitive callers)
  • withJournalLock in src/domain/graph/journal.ts:210 (9 transitive callers)
  • appendJournalEntries in src/domain/graph/journal.ts:277 (0 transitive callers)
  • writeJournalHeader in src/domain/graph/journal.ts:297 (4 transitive callers)
  • appendJournalEntriesAndStampHeader in src/domain/graph/journal.ts:327 (4 transitive callers)

Address two Greptile review issues on the journal lockfile:

- P1 TOCTOU: when two stealers observed the same stale (dead-PID)
  holder, one's unlink could cross the other's fresh openSync('wx')
  acquisition, admitting both writers into the critical section. Replace
  the unlink + openSync('wx') pattern with an atomic write-tmp + rename
  steal, then verify via a random nonce. If another stealer's rename
  landed after ours, we bail and retry instead of unlinking their live
  lockfile. Release now also nonce-verifies before unlinking.

- P2 event-loop blockage: replace Atomics.wait with a short hrtime
  busy-spin so pending FS events and timer callbacks in the watcher
  keep firing during the 25ms retry window.

Add a regression test for the stale-lock steal race that asserts we
never unlink a lockfile whose nonce does not match our own.

Impact: 7 functions changed, 10 affected
@carlos-alm
Copy link
Copy Markdown
Contributor Author

@greptileai

Comment on lines +98 to +108
try {
const fd = fs.openSync(lockPath, 'wx');
try {
fs.writeSync(fd, `${process.pid}\n${nonce}\n`);
} catch {
/* PID stamp is advisory; fd is still exclusive */
}
return { fd, nonce };
} catch (e) {
if ((e as NodeJS.ErrnoException).code !== 'EEXIST') throw e;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Silent writeSync failure voids the nonce and breaks mutual exclusion

If fs.writeSync throws (e.g. ENOSPC, I/O error), the lockfile is created by openSync('wx') but remains empty — the nonce is never written. Two consequences compound:

  1. releaseJournalLock reads the empty file, content.includes(lock.nonce) is false, and the lockfile is orphaned rather than unlinked.
  2. Any concurrent waiter reads the empty file, computes Number('') → 0, then calls isPidAlive(0) which returns false (the pid <= 0 guard). With holderAlive = false, it immediately calls trySteal — stealing a live, active lock and breaking mutual exclusion.

The comment "PID stamp is advisory; fd is still exclusive" only holds when there are no concurrent waiters, but a journal with a watcher + build process is exactly the concurrent scenario this lock is meant to protect.

Fix: when the write fails, release the fd and unlink synchronously (we still hold the exclusive fd at that point) and retry, rather than proceeding into fn() with an un-stamped lockfile.

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e17da0a. When the nonce stamp writeSync throws, acquireJournalLock now closes the fd, unlinks the empty lockfile, and retries instead of returning with an un-stamped lockfile. This prevents concurrent waiters from reading an empty nonce, treating isPidAlive(0) as dead, and stealing the live lock.

Integrates main's appendJournalEntriesAndStampHeader and mkdirSync cleanup
with this PR's lockfile serialization. The new function is also wrapped in
withJournalLock so it serializes with appendJournalEntries and writeJournalHeader
against the same journal file. Also addresses Greptile P1: silent writeSync
failure in acquireJournalLock now releases the fd + unlinks the empty lockfile
and retries, preventing empty-lockfile mutual-exclusion breakage.

Impact: 10 functions changed, 16 affected
…ntry (#1002)

Addresses Greptile P2: crash-mid-steal in trySteal leaves
.codegraph/changes.journal.lock.<nonce>.tmp files behind. Without cleanup
they accumulate silently across crash cycles.

Adds sweepStaleTmpFiles called at the top of withJournalLock which removes
any changes.journal.lock.*.tmp older than LOCK_STALE_MS. The age filter
avoids racing an in-flight steal on another process.

Impact: 2 functions changed, 10 affected
@carlos-alm
Copy link
Copy Markdown
Contributor Author

Addressed Greptile's review feedback from the 5/5 review:

  • P1 silent writeSync failure (comment 3122282730): Fixed in e17da0a. When the nonce stamp fails, acquireJournalLock now closes the fd, unlinks the empty lockfile, and retries instead of proceeding with an un-stamped lockfile that concurrent waiters would treat as stale.
  • P2 orphaned .tmp files on crash mid-steal: Fixed in 7534d65. Added sweepStaleTmpFiles at the top of withJournalLock that removes changes.journal.lock.*.tmp files older than LOCK_STALE_MS. The age filter prevents racing an in-flight steal on another process. Regression test added.
  • P2 test is single-process / does not exercise inter-process concurrency: Kept the single-process format-integrity test as-is. A true multi-process test would require spawning child processes with cross-platform handling (especially Windows CI where child_process.spawnSync semantics differ), which is significant complexity for marginal coverage gain over the P1 regression test (does not unlink another writer's lockfile after a stale-lock steal race), which already validates the core invariant the lock is meant to protect. Happy to revisit if the watcher+build scenario ever regresses.

@carlos-alm
Copy link
Copy Markdown
Contributor Author

@greptileai

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(journal): concurrent appends are not atomic — corrupted under parallel builds

1 participant